Machine Learning With Android and Arduino

Introduction

I collected data on three different surfaces: carpet, cobble stone, and ceramic tile. The goal of the project is to design a classification scheme that will identify the surface based on readings from the tri-axial accel. This project imports the data from the Accel Plot and Arduino project (http://www.instructables.com/id/Realtime-MPU-6050A0-Data-Logging-With-Arduino-and-/), performs some exploratory analysis, and uses decision trees to classify the data.

Instrumentation Setup

I fabricated a shield for the Arduino Uno that included the HC-05 Bluetooth module and the MPU-6050 accelerometer/gyrometer sensors. For this project only the accelerometers of the MPU-6050 are used. The figure below shows the shield layout.

title

I mounted the shield to a small remote-controlled (RC) car, as shown in the photo below. This was a safe platform for me to test the data collection and operate the Android data collection device. Channel 1 is the x-axis accelerometer and is positive when the car moves backwards, channel 2 is the y-axis accelerometer and is positive when the car moves towards the drive, and channel 3 is the z-axis accelerometer and is positive when the car moves upwards.

title

Data Collection

I used the Accel Plot (https://play.google.com/store/apps/details?id=com.dairyroadsolutions.accelplot&hl=en) application to display and collect the data. Each file is 1 minute long (15,000 samples). I made two different runs on each surface. The image below shows the three different surfaces.

title

Notebook Setup

This line is required to display the plots in the notebook

In [1]:
%matplotlib inline

Definitions and Functions

This value is defined in the Arduino code and documented in "Bluetooth.java" in the application

In [2]:
D_SAMPLING_FREQUENCY = 250.0

Pull in the libraries and define the functions we will be using

In [3]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('ggplot')

This function pulls in the data for each channel from Accel Plot data files

In [4]:
def getChannelData(iChannel, strTestCase):
    cwd = os.getcwd()
    cwd = cwd+'\\TrainingData\\'+strTestCase

    f = []
    for (dirpath, dirnames, filenames) in os.walk(cwd):
        f.extend(filenames)
        break
        
    strFileSearch = 'Trace0' + str(iChannel)
    strFiles = filter(lambda x:strFileSearch in x, f)
    
    
    for idx in range(0, len(strFiles)):
        fh = open(cwd+'\\'+strFiles[idx], 'rb')
        # read the data into numpy
        if(idx==0):
            xEnd = np.fromfile(fh, dtype=('>f'))
        else:
            xEnd = np.append(x, np.fromfile(fh, dtype=('>f')))
        fh.close()
    
    # We have to switch the underlying NumPy array to native system
    # Great write up at: http://pandas.pydata.org/pandas-docs/stable/gotchas.html. 
    # If you don't do this you get an error: ValueError: Big-endian buffer not supported on little-endian compiler
    x = xEnd.byteswap().newbyteorder()
    
    return (x,strFiles)

This function pads the ends of a data with first valid samples at each end

In [5]:
def padDataFrame(dfData, idx):

    for column in dfData:
        fTemp =  float(dfData[column].iloc[idx:(idx+1)])
        dfData[column].iloc[0:idx] = fTemp
        fTemp = float(dfData[column].iloc[-idx-1:-idx])
        dfData[column].iloc[-idx:] = fTemp
    
    return dfData

This function calculates a rolling (moving) mean for the data frame. Pad with the first valid sample.

In [6]:
def getDataFrameRM(dfData, window_size, bRename=True):
    
    # These lines add the rolling average
    dfDataRM = dfData.rolling(window=window_size, center=True).mean()

    if( bRename):
        dfDataRM.columns = dfDataRM.columns+'_rm'
    
    # zero-indexed, no need to add 1
    idx = int(window_size/2)

    # Pad with the closest good value
    dfDataRM = padDataFrame(dfDataRM, idx)

    return dfDataRM

This function calculates the rolling kurtosis

In [7]:
def getDataFrameKR(dfData, window_size):
    
    # These lines add the rolling kurtosis
    dfDataKR = dfData.rolling(window=window_size, center=True).kurt()

    dfDataKR.columns = dfDataKR.columns+'_kr'
    
    # zero-indexed, no need to add 1
    idx = int(window_size/2)

    # Pad with the closest good value
    dfDataKR = padDataFrame(dfDataKR, idx)

    return dfDataKR

This function implements a clever way of calculating the rolling RMS values, from: http://stackoverflow.com/questions/8245687/numpy-root-mean-squared-rms-smoothing-of-a-signal

In [8]:
def getRMS(data, window_size):
    data2 = np.power(data,2)
    window = np.ones(window_size)/float(window_size)
    return np.sqrt(np.convolve(data2, window, 'same'))

This function returns a new data frame with rolling RMS values

In [9]:
def getDataFrameRMS(dfData, window_size):
    
    dfRMS = dfData.copy(deep=True)
    
    for column in dfData:
        data = np.array(dfData[column])
        dfRMS[column] = getRMS(data, window_size)
        
    dfRMS.columns = dfData.columns+'_rms'
    
    # zero-indexed, no need to add 1
    idx = int(window_size/2)
    
    dfRMS = padDataFrame(dfRMS, window_size)
        
    return dfRMS

This function calculates peak values for each column in the entire data frame. The signal processing model includes: remove the mean, rectify the signal, restore the mean, and keep the rolling maximum values.

In [10]:
def getDataFramePk(dfData, window_size, bRollingMeanOffset = True):
    
    # We want to rectify about the mean
    if (bRollingMeanOffset):
        mn = getDataFrameRM(dfData, window_size, bRename=False)
    else:
        mn = dfData.mean()
        
    dfPk = dfData-mn
    dfPk = dfPk.abs()
    dfPk = dfPk+mn
    
    # Rolling maximum
    dfPk = dfPk.rolling(window = window_size, center=True).max()
    
    # correct the column names    
    dfPk.columns = dfPk.columns+'_pk'
    
    # zero-indexed, no need to add 1
    idx = int(window_size/2)
    
    # Pad with the closest good value
    dfPk = padDataFrame(dfPk, idx)
        
    return dfPk

This function calculates the crest factor (ratio of peak to rms) for each columns in a data frame

In [11]:
def getDataFrameCrest(dfData, dfDataPk, dfDataRMS):
    
    dfCrest = dfDataPk.copy(deep=True)
    
    iCol = len(dfDataPk.columns)
    
    for idxCol in range(0,iCol):
        dataPk = np.array(dfDataPk.ix[:,idxCol])
        dataRMS = np.array(dfDataRMS.ix[:,idxCol])
        dfCrest.ix[:,idxCol] = np.divide(dataPk, dataRMS)
        
    dfCrest.columns = dfData.columns+'_cr'
        
    return dfCrest

This function pulls the arrays from the data file function into a single data frame and addes the extracted values

In [12]:
def getDataAsFrame(strTestCase, strClass, bScaleData=True):
    
    # Read the data in
    (x1,strFiles1) = getChannelData(1,strFolder)
    (x2,strFiles2) = getChannelData(2,strFolder)
    (x3,strFiles3) = getChannelData(3,strFolder)
    (x4,strFiles4) = getChannelData(4,strFolder)
    t = np.divide(range(0,len(x1)),D_SAMPLING_FREQUENCY)

    # Construct the data frame
    dfData = pd.DataFrame(data={('t'):t,
                                ('Ch1'):x1, 
                                ('Ch2'):x2, 
                                ('Ch3'):x3, 
                                ('Ch4'):x4,
                                'Surface':strClass,
                                'File':strTestCase})

    data_cols = [col for col in dfData.columns if 'Ch' in col]

    # Rolling average
    window_size_rm = 50
    dfDataRM = getDataFrameRM(dfData[data_cols], window_size_rm, bRename=False)

    # Rolling mean residual (basically, highpass filtered, high frequency, signal)
    dfDataHF = dfData[data_cols] - dfDataRM
    dfDataHF.columns = dfDataHF.columns+'_hf'
    
    # Peak of the high frequency signal
    window_size_hp_pk = 5
    dfDataHFPk = getDataFramePk(dfDataHF, window_size_hp_pk)

    # Velocity of high frequency signal
    #dfDataVL = (dfData[data_cols] - dfDataRM).cumsum()
    dfDataVL = (dfDataRM-dfDataRM.mean()).cumsum()
    dfDataVL.columns = dfDataVL.columns+"_vl"
    
    # Now that we are through subtracting, rename the rolling mean columns
    dfDataRM.columns = dfDataRM.columns+'_rm'

    # Rolling RMS
    window_size = 11
    dfDataRMS = getDataFrameRMS(dfData[data_cols], window_size)

    # Rolling peak value
    window_size = 25
    dfDataPk = getDataFramePk(dfData[data_cols], window_size)

    # Peak value of the rolling mean
    window_size_rm_pk = window_size_rm*5
    dfDataRM_Pk = getDataFramePk(dfDataRM, window_size_rm_pk, bRollingMeanOffset = True)

    # Aggregate the dataframes
    dfData = pd.concat([dfData, dfDataRM, dfDataRMS, dfDataHF, dfDataHFPk, dfDataPk, dfDataRM_Pk, dfDataVL],
                       axis=1, join_axes = [dfData.index])
    
    return dfData

This function appends one dataframe to another

In [13]:
def appendDataAsFrame(strTestCase, dfData, strClass):
    dfNew = getDataAsFrame(strTestCase, strClass)
    dfDataOut = dfData.append(dfNew)
    dfDataOut = dfDataOut.reset_index(drop=True)

    return dfDataOut

This function plots out the signal features against the raw data

In [14]:
def plotFeatures(dfDataPlot, strColName):
    
    fig, axes = plt.subplots(nrows=3, ncols=2)
    fig.subplots_adjust(wspace=.5, hspace=0.5)
    
    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_rm'], 
                     ax=axes[0,0], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName+" Rolling Mean")

    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_rms'], 
                     ax=axes[0,1], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName+" RMS")

    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_pk'], 
                     ax=axes[1,0], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName+" Peak")

    ax = dfDataPlot.plot(x='t', y=[strColName+"_rm", strColName+'_rm_pk'], 
                     ax=axes[1,1], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName+" Peak of Rolling Mean")
    
    ax = dfDataPlot.plot(x='t', y=[strColName+"_hf", strColName+"_hf_pk"], 
                     ax=axes[2,0], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName+" High-frequency Peak")
    
    ax = dfDataPlot.plot(x='t', y=[strColName+"_vl"], 
                     ax=axes[2,1], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName+" Velocity")

This function plots the timebase data in the data frame

In [15]:
def plotFolder(dfDataPlot):
    
    fig, axes = plt.subplots(nrows=2, ncols=2)
    fig.subplots_adjust(wspace=.5, hspace=0.5)
    
    strColName = 'Ch1'
    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_rm', strColName+'_pk'], 
                     ax=axes[0,0], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName)

    strColName = 'Ch2'
    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_rm', strColName+'_pk'], 
                     ax=axes[0,1], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName)

    strColName = 'Ch3'
    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_rm', strColName+'_pk'], 
                     ax=axes[1,0], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName)

    strColName = 'Ch4'
    ax = dfDataPlot.plot(x='t', y=[strColName, strColName+'_rm', strColName+'_pk'], 
                     ax=axes[1,1], legend=True, figsize=(10,10))
    ax.set_xlabel('Time, seconds')
    ax.set_ylabel('Amplitude, ADC counts')
    ax.set_title(strColName)

This function plots the histograms of the different classes of data

In [16]:
def plotClasses(dfData, strSuff):
    
    fig, axes = plt.subplots(nrows=2, ncols=2)
    fig.subplots_adjust(wspace=0.5, hspace=0.5)
    
    strClass = ['Cobble', 'Tile', 'Carpet']

    dfData1 = dfData.loc[dfData['Surface'] == strClass[0]]
    dfData1.columns = dfData1.columns+'_'+strClass[0]
    dfData1 = dfData1.reset_index(drop=True)
    dfData2 = dfData.loc[dfData['Surface'] == strClass[1]]
    dfData2.columns = dfData2.columns+'_'+strClass[1]
    dfData2 = dfData2.reset_index(drop=True)
    dfData3 = dfData.loc[dfData['Surface'] == strClass[2]]
    dfData3.columns = dfData3.columns+'_'+strClass[2]
    dfData3 = dfData3.reset_index(drop=True)
    dfDataPlot = pd.concat([dfData1, dfData2, dfData3], axis=1, join_axes=[dfData1.index])

    strSeries = ['Ch1_' + strSuff + s for s in strClass]
    ax1 = dfDataPlot[strSeries].plot.hist(figsize=(12,12), ax=axes[0, 0], alpha = 0.4, bins=50)

    strSeries = ['Ch2_' + strSuff + s for s in strClass]
    ax1 = dfDataPlot[strSeries].plot.hist(figsize=(12,12), ax=axes[0, 1], alpha = 0.4, bins=50)

    strSeries = ['Ch3_' + strSuff + s for s in strClass]
    ax1 = dfDataPlot[strSeries].plot.hist(figsize=(12,12), ax=axes[1, 0], alpha = 0.4, bins=50)

    #strSeries = ['Ch4_' + strSuff + s for s in strClass]
    #ax1 = dfDataPlot[strSeries].plot.hist(figsize=(12,12), ax=axes[1, 1], alpha = 0.4, bins=50)

This function plots correlation by channel

In [17]:
def plotCorrChannel(strChannel):
    
    # Section the data and calc correlation matrix
    plot_cols = [col for col in dfData.columns if strChannel in col]
    dfPlot = dfData[plot_cols]
    correlations = dfPlot.corr()
    names = list(dfPlot)
    iCols = len(dfPlot.columns)
    
    # plot correlation matrix
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    cax = ax.matshow(correlations, vmin=-1, vmax=1)
    fig.colorbar(cax)
    ticks = np.arange(0,iCols,1)
    ax.set_xticks(ticks)
    ax.set_yticks(ticks)
    ax.set_xticklabels(names)
    ax.set_yticklabels(names)
    plt.show()

Check the correlation of the tri-axial accel

In [18]:
def plotCorrAccel(dfData):
    
    # Section the data and calc correlation matrix
    plot_cols = ['Ch1', 'Ch2', 'Ch3']
    dfPlot = dfData[plot_cols]
    correlations = dfPlot.corr()
    names = list(dfPlot)
    iCols = len(dfPlot.columns)
    
    # plot correlation matrix
    fig = plt.figure(figsize=(5,5))
    ax = fig.add_subplot(111)
    cax = ax.matshow(correlations, vmin=-1, vmax=1)
    fig.colorbar(cax)
    ticks = np.arange(0,iCols,1)
    ax.set_xticks(ticks)
    ax.set_yticks(ticks)
    ax.set_xticklabels(names)
    ax.set_yticklabels(names)
    plt.show()

This function plots the features for the different surfaces

In [19]:
def plotCorrFeature(dfData, strFeature):
    plot_cols=['Tile', 'Carpet', 'Cobble']
    
    
    dfPlot = pd.DataFrame(data={(strFeature+'_'+plot_cols[0]):np.array(dfData[strFeature].loc[dfData['Surface'] == plot_cols[0]]),
                                (strFeature+'_'+plot_cols[1]):np.array(dfData[strFeature].loc[dfData['Surface'] == plot_cols[1]]), 
                                (strFeature+'_'+plot_cols[2]):np.array(dfData[strFeature].loc[dfData['Surface'] == plot_cols[2]])})
    
    
    #fig0 = plt.figure(figsize=(10,10))
    #plt.plot(dfPlot[strFeature+'_'+plot_cols[0]])
    
    correlations = dfPlot.corr()
    names = list(dfPlot)
    iCols = len(dfPlot.columns)
    
    # plot correlation matrix
    fig = plt.figure(figsize=(5,5))
    ax = fig.add_subplot(111)
    cax = ax.matshow(correlations, vmin=-1, vmax=1)
    fig.colorbar(cax)
    ticks = np.arange(0,iCols,1)
    ax.set_xticks(ticks)
    ax.set_yticks(ticks)
    ax.set_xticklabels(names)
    ax.set_yticklabels(names)
    plt.show()    

Exploratory Analysis

Begin by getting the data into python. Later this data will be split into two independent data sets, one we will use to train on and another that will be used for testing. The functions also build the features (rolling mean, RMS, peak, etc.) from the raw signal.

In [20]:
strClass = 'Cobble'
strFolder = 'Cobble1'
dfData = getDataAsFrame(strFolder, strClass)
strFolder = 'Cobble2'
dfData = appendDataAsFrame(strFolder, dfData, strClass)

strClass = 'Carpet'
strFolder = 'Carpet1'
dfData = appendDataAsFrame(strFolder, dfData, strClass)
strFolder = 'Carpet2'
dfData = appendDataAsFrame(strFolder, dfData, strClass)

strClass = 'Tile'
strFolder = 'Tile1'
dfData = appendDataAsFrame(strFolder, dfData, strClass)
strFolder = 'Tile2'
dfData = appendDataAsFrame(strFolder, dfData, strClass)

We'll begin by plotting out our features against the raw data to be sure the algorithms working correctly.

In [21]:
plotFeatures(dfData.loc[dfData['Surface'] == 'Carpet'].iloc[:750], strColName="Ch1")
In [22]:
plotFeatures(dfData.loc[dfData['Surface'] == 'Carpet'].tail(750), strColName="Ch1")

The remaining plots and analysis will be simplified if the data is normalized

In [23]:
data_cols = [col for col in dfData.columns if 'Ch' in col]
dfMean = dfData[data_cols].mean()
dfData[data_cols] = dfData[data_cols]-dfMean
dfStd = dfData[data_cols].std()
dfData[data_cols] = dfData[data_cols]/dfStd

The three accelerometers on the MPU-6050 are physically orthogonal. If they are operating correctly (low noise and low cross-axis sensitivity), they should be informationally orthogonal as well. To check this, the three channels will be plotted on a correlated plot. Everything off the diagonals should be uncorrelated.

In [24]:
plotCorrAccel(dfData)

Looks like the signal features are being extracted correctly, let's see how they relate to each other. The raw signal, "Ch1", is uncorrelated with nearly all the signal features, except the high frequency signal "Ch1_hf". Although not plotted, it is similar for "Ch2" and "Ch3"

In [25]:
strChannel = 'Ch1'
plotCorrChannel(strChannel)

Plot out the carpet data. This surface was compliant so this should have lowest overall amplitudes. I also plotted the light sensor data, but it is not used in the classification. Each plot shows the acquired data (red line), the rolling mean value (blue line), and the peak value (purple line)

In [26]:
plotFolder(dfData.loc[dfData['File'] == 'Carpet1'])
plt.savefig("Carpet1Timebase.pdf", format='pdf')

Now plot the data from the tile. This surface is harder so there should be higher amplitudes, especially in the vertical direction (channel 3, z-axis)

In [27]:
plotFolder(dfData.loc[dfData['File'] == 'Tile1'])
plt.savefig("Tile1Timebase.pdf", format='pdf')

Now plot the data from the cobblestorn. This surface is both hard and rough so this has the highest amplitudes.

In [28]:
plotFolder(dfData.loc[dfData['File'] == 'Cobble1'])
plt.savefig("Cobble1Timebase.pdf", format='pdf')

Next, the distributions of the features are plotted on top of each other. In an ideal world, each of the 3 cases (tile, care, and cobblestone) would be far apart. In the real world it won't be that be clear. This section shows the histograms for the raw signal. The accel signals are all right on top of each other. Ch4 is the light sensor, which doesn't really play in this project.

In [29]:
strSuff = ''
plotClasses(dfData, strSuff)

This is the rolling mean ('rm') data. The cobble histogram is moving away from the tile and carpet, but not enough.

In [30]:
strSuff = 'rm_'
plotClasses(dfData, strSuff)

RMS is getting better separation, but that darn tile and carpet are still together.

In [31]:
strSuff = 'rms_'
plotClasses(dfData, strSuff)

Same for the high frequency signal

In [32]:
strSuff = 'hf_'
plotClasses(dfData, strSuff)

Not great separation for the peak value of the high frequency signal

In [33]:
strSuff = 'hf_pk_'
plotClasses(dfData, strSuff)

With the peak feature, there is finally some separation of the tile and carpet.

In [34]:
strSuff = 'pk_'
plotClasses(dfData, strSuff)

The peak value of the rolling mean gives us the best separation so far.

In [35]:
strSuff = 'rm_pk_'
plotClasses(dfData, strSuff)

The velocity also starts to separate out. There is still some overlap in the carpet and tile, but overall the three distributions can be seen

In [36]:
strSuff = 'vl_'
plotClasses(dfData, strSuff)

Make sure the interesting features are uncorrelated

In [37]:
plotCorrFeature(dfData, 'Ch1_rm_pk')
In [38]:
from sklearn import svm, datasets
import sklearn.ensemble as ens
import sklearn.tree as trees
import pydotplus as pdot
from sklearn.metrics import confusion_matrix
from IPython.display import Image

This function defines a nice way to visualize the confusion matrix. From scikit-learn documentation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

In [39]:
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

This sections selects the data and randomizes the order. I didn't normalize the data (subtract the mean and divide by the standard deviation), but that might improve the model.

In [64]:
# select our data
lstFeatures = ['Ch2_vl','Ch3_vl']
lstClassifiers = ['Tile','Cobble','Carpet']
In [65]:
# For testing of the code we want the same seed
RANDOM_SEED = 10234

# Shuffle the data
idx = np.arange(dfData.shape[0])
np.random.seed(RANDOM_SEED)
np.random.shuffle(idx)
dfDataTrain = dfData.ix[idx[0:(idx.size/2)]]
dfDataTest = dfData.ix[idx[(idx.size/2):idx.size]]
In [66]:
# Break the data into features and classes
TrainFeatures = dfDataTrain[lstFeatures]
TrainClasses = dfDataTrain['Surface']
TestFeatures = dfDataTest[lstFeatures]
TestClasses = dfDataTest['Surface']

In this section we will fit the data and plot the model. We are going to use "Decision Trees" because they are simple and perform well with non-linear data (refer to http://scikit-learn.org/stable/modules/tree.html#tree for more details on decision trees). We sacrifice some accuracy in the model, but I wanted something that could be implemented in the Arduino.

We are going to start with a simple tree, only 2 levels deep since that could very likely be implemented in the Arduino

In [67]:
# Configure the decision tree and perform the fit
tree_depth=2
dtTrain = trees.DecisionTreeClassifier(max_depth=tree_depth)
dtTrain = dtTrain.fit(TrainFeatures, TrainClasses)
In [68]:
# If you haven't installed "graphviz" you will need to do that.
# I had to infer the ordering of the classifiers and manually
# input them. There must a better way.
dotData = trees.export_graphviz(dtTrain, out_file='None.dot',
                         feature_names=lstFeatures,  
                         class_names=[lstClassifiers[2], lstClassifiers[1], lstClassifiers[0]],  
                         filled=True, rounded=True,  
                         special_characters=True)
dtGraph = pdot.graph_from_dot_file('None.dot')
Image(dtGraph.create_png())
Out[68]:

The model results can also be summarized in something called a confusion matrix. The matrix has the correct results on the diagonals so you want these values to be high. For our fit, the diagonals are high and the off-diagonal terms are small. This is not a bad model so far.

In [69]:
predTrainClasses = dtTrain.predict(TrainFeatures)
cnm = confusion_matrix(predTrainClasses, TrainClasses, labels=lstClassifiers)
plot_confusion_matrix(cnm, lstClassifiers)
Confusion matrix, without normalization
[[11420    98     0]
 [    0 14265     0]
 [ 3637   600 14980]]

The better indicator of performance will be on the test data. We still have high values in the diagonal terms, but there are also higher values in the off-diagonal terms.

In [46]:
predTestClasses = dtTrain.predict(TestFeatures)
cnm = confusion_matrix(predTestClasses, TestClasses, labels=lstClassifiers)
plot_confusion_matrix(cnm, lstClassifiers)
Confusion matrix, without normalization
[[11280   117     0]
 [    0 14341     0]
 [ 3663   579 15020]]

Plot the data out. When possible, I like to see how the model and data interact. One way to do this is put the model on a contour plot and overlay it with the data. That's what we are going to do in this section

In [47]:
# mesh for the features
h=0.01
x_min, x_max = TrainFeatures[lstFeatures[0]].min() - 1, TrainFeatures[lstFeatures[0]].max() + 1
y_min, y_max = TrainFeatures[lstFeatures[1]].min() - 1, TrainFeatures[lstFeatures[1]].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
In [48]:
plot_colors = "ryb"
cmapIn = plt.cm.RdYlBu
iClassifiers = len(lstClassifiers)

plt.figure(num=None, figsize=(6, 6))
Z = dtTrain.predict(np.c_[xx.ravel(), yy.ravel()])
dctTemp = {lstClassifiers[0]:0, lstClassifiers[1]:1, lstClassifiers[2]:2}
Zt = np.zeros_like(Z)
for idx in range(0,len(Z)):
    Zt[idx] = dctTemp[Z[idx]]
Zt = Zt.reshape(xx.shape)
cs = plt.contourf(xx, yy, Zt, cmap=cmapIn, alpha=0.4)

for i, cIn in zip(xrange(iClassifiers), plot_colors):
    idx = np.where(TrainClasses==lstClassifiers[i])[0]
    plt.scatter(TrainFeatures.iloc[idx[1:3500],0], TrainFeatures.iloc[idx[1:3500],1], label=lstClassifiers[i], cmap=cmapIn, 
                c=cIn, edgecolor='black', s=100)
plt.legend()
plt.xlabel(lstFeatures[0])
plt.ylabel(lstFeatures[1])
plt.savefig("Scatter.pdf", format='pdf')

It is tempting to try to improve the model by adding complexity. So let's try an experiment and add several layers to our decision tree. We'll begin by building another model and test it.

In [78]:
# Configure the decision tree and perform the fit
tree_depth=10
dtTrain3 = trees.DecisionTreeClassifier(max_depth=tree_depth)
dtTrain3 = dtTrain3.fit(TrainFeatures, TrainClasses)
In [79]:
dotData3 = trees.export_graphviz(dtTrain3, out_file='None3.dot',
                         feature_names=lstFeatures,  
                         class_names=[lstClassifiers[2], lstClassifiers[1], lstClassifiers[0]],  
                         filled=True, rounded=True,  
                         special_characters=True)
dtGraph3 = pdot.graph_from_dot_file('None3.dot')
Image(dtGraph3.create_png())
Out[79]:

That's quite a tree! It would be difficult to implement in an Arduino Uni. How does the performance look? We'll begin by looking at the confusion matrix for the training data.

In [82]:
predTrainClasses3 = dtTrain3.predict(TrainFeatures)
cnm= confusion_matrix(predTrainClasses3, TrainClasses, labels=lstClassifiers)
plot_confusion_matrix(cnm, lstClassifiers)
Confusion matrix, without normalization
[[14447   160   201]
 [   27 14628     0]
 [  583   175 14779]]

The model seems to be improved. As mentioned before, the real proof will be on the test data

In [81]:
predTestClasses3 = dtTrain3.predict(TestFeatures)
cnm= confusion_matrix(predTestClasses3, TestClasses, labels=lstClassifiers)
plot_confusion_matrix(cnm, lstClassifiers)
Confusion matrix, without normalization
[[14259   198   210]
 [   57 14683     0]
 [  627   156 14810]]

The additional layers did improve the model, but it is now very complex and difficult to code to an Arduino

Conclusion

Looks like we have a valid model for road surface detection. In a future step of the project we will implement the model on the Arduino, then take the car for a drive to see how well the Arduino can tell what the surface is.

In [ ]: